Infinity Embedding Node
The Infinity Embedding node converts text content into numerical vector representations using a self-hosted Infinity embedding server with high-performance optimized batching. It provides an OpenAI-compatible API for easy integration and supports most Sentence Transformers models from HuggingFace. The server architecture enables efficient resource management and concurrent request handling.
How It Works
When the node executes, it receives text input from a workflow variable, sends the text to the Infinity server via HTTP requests using an OpenAI-compatible API format, and returns embedding vectors as arrays of floating-point numbers. Each text input produces one embedding vector, with dimensionality determined by the model loaded on the Infinity server (e.g., BAAI/bge-small-en-v1.5 produces 384-dimensional vectors). The node validates text content, constructs API requests with the specified model identifier, sends batched requests to the Infinity server, and stores the resulting vectors in the output variable.
The Infinity server architecture provides performance advantages over direct model loading by implementing optimized batching, dynamic model loading, and efficient resource management. The server handles multiple concurrent requests efficiently, batches embeddings for optimal GPU utilization, and manages model memory automatically. The node communicates through an OpenAI-compatible API, making it easy to switch between Infinity and OpenAI embedding services by changing the endpoint URL.
Output embeddings maintain correlation with input items through unique identifiers, with each embedding traced back to its source text via UUID. The node supports configurable batch sizes to optimize the trade-off between throughput and memory usage. Failed embedding generation for individual items does not stop processing of other items. The Infinity server must be running and accessible at the configured URL before workflow execution.
Configuration Parameters
Input Field
Input Field (Text, Required): Workflow variable containing text to embed.
The node expects a list of embedding request objects where each object contains a type field (set to "text"), an optional id field (string for tracking), and a text field (string content to embed). Single objects are automatically converted to single-item lists.
Example input structure:
[
{"type": "text", "id": "doc1", "text": "First document content"},
{"type": "text", "id": "doc2", "text": "Second document content"}
]
Output Field
Output Field (Text, Required): Workflow variable where embedding results are stored.
The output is a list of EmbeddingResponse objects where each object contains a uuid field (string identifier matching input ID or generated UUID) and an embeddings field (array of floating-point numbers). The list maintains the same order as the input. Empty embeddings are returned for failed generation attempts.
Example output structure:
[
{"uuid": "doc1", "embeddings": [0.123, -0.456, 0.789, ...]},
{"uuid": "doc2", "embeddings": [0.234, -0.567, 0.890, ...]}
]
Common naming patterns: text_embeddings, document_vectors, infinity_embeddings, server_embeddings.
Model
Model (Text, Required): Model identifier or HuggingFace path for the embedding model on the Infinity server.
Common models include BAAI/bge-small-en-v1.5 (384 dimensions, efficient) and sentence-transformers/all-MiniLM-L6-v2 (384 dimensions, fast). The model must be loaded on the Infinity server before use—the node does not load models automatically. Variable interpolation using ${variable_name} syntax is supported.
Infinity API URL
Infinity API URL (Text, Required): URL where the Infinity embedding server is running.
The URL should include the protocol (http:// or https://) and port number if non-standard. Common format is http://localhost:7997 for local deployments or https://infinity.example.com for remote servers. The server must be running and accessible before workflow execution. Variable interpolation is supported. HTTPS is recommended for production deployments to encrypt embedding data in transit.
Batch Size
Batch Size (Number, Optional): Number of texts to process in each batch sent to the server.
Larger batch sizes (32, 64, 128) increase throughput by maximizing GPU utilization. Smaller batch sizes (8, 16) reduce memory usage and latency. The optimal size depends on available GPU memory, text length, and model size. Leave empty to use the server's default. Minimum value is 1 if specified.
Common Parameters
This node supports common parameters shared across workflow nodes, including Stream Output Response, Streaming Messages, and Logging Mode. For detailed information, see Common Parameters.
Best Practices
- Deploy the Infinity server on GPU-equipped infrastructure for optimal performance; CPU-only generation is significantly slower
- Configure Batch Size based on workload: larger batches (64-128) for high-throughput batch processing, smaller batches (8-16) for low-latency interactive workflows
- Pre-load models on the Infinity server before workflow execution to avoid cold-start delays
- Use the same model for both document and query embeddings in search systems to ensure vector compatibility
- Monitor Infinity server resource usage (GPU memory, CPU, network) to identify bottlenecks and optimize batch sizes
- Store the Infinity API URL in workflow variables for easy switching between development, staging, and production servers
- Consider running multiple Infinity server instances with different models to avoid model switching overhead
Limitations
- External server dependency: The node requires a running Infinity embedding server. The workflow fails if the server is unreachable or not responding.
- Model pre-loading required: Models must be loaded on the Infinity server before use. The node does not load or manage models.
- Text-only support: The node only supports text embeddings. Image embedding requests fail even though the node accepts multimodal input format.
- No authentication support: The node does not support authentication headers or API keys. Server authentication must be configured at the network or proxy level.
- Batch size constraints: Very large batch sizes may exceed server memory limits or timeout thresholds. Monitor server logs to identify optimal sizes.
- Network latency: Embedding performance depends on network latency between the workflow engine and Infinity server. Co-locate them when possible.